Your mission

Perform text analysis.

Okay, I need more information

Perform sentiment analysis or topic modeling using text analysis methods as demonstrated in the pre-class work and in the readings.

Okay, I need even more information.

Do the above. Can’t think of a data source?

  • gutenbergr
  • AssociatedPress from the topicmodels package
  • NYTimes or USCongress from the RTextTools package
  • Harry Potter Complete 7 Books text ``` if (packageVersion(“devtools”) < 1.6) { install.packages(“devtools”) }

devtools::install_github(“bradleyboehmke/harrypotter”) ``- [State of the Union speeches](https://pradeepadhokshaja.wordpress.com/2017/03/31/scraping-the-web-for-presdential-inaugural-addresses-using-rvest/) - Scrape tweets using [twitteR`](https://www.credera.com/blog/business-intelligence/twitter-analytics-using-r-part-1-extract-tweets/)

Analyze the text for sentiment OR topic. You do not need to do both. The datacamp courses and Tidy Text Mining with R are good starting points for templates to perform this type of analysis, but feel free to expand beyond these examples.

Timelines and Task

We will spend the next 2 weeks working on analyzing textual data in R. You will do the following:

Gather data from Github and store the Harry Potter Complete 7 Books text in a datafrome

# sevenbook is a tidy text format dataframe including 7 novels
sevenbook
## # A tibble: 409,338 x 4
##    chapter      word              title series
##      <int>     <chr>              <chr>  <int>
##  1       1       boy philosophers_stone      1
##  2       1     lived philosophers_stone      1
##  3       1   dursley philosophers_stone      1
##  4       1    privet philosophers_stone      1
##  5       1     drive philosophers_stone      1
##  6       1     proud philosophers_stone      1
##  7       1 perfectly philosophers_stone      1
##  8       1    normal philosophers_stone      1
##  9       1    people philosophers_stone      1
## 10       1    expect philosophers_stone      1
## # ... with 409,328 more rows

1. Common words Analysis

** 1.1 What are top words in each book? **

# Top 10 words in each novel
top_words
## # A tibble: 70 x 3
## # Groups:   title [7]
##                 title      word     n
##                 <chr>     <chr> <int>
##  1 chamber_of_secrets     harry  1503
##  2 chamber_of_secrets       ron   650
##  3 chamber_of_secrets  hermione   289
##  4 chamber_of_secrets    malfoy   202
##  5 chamber_of_secrets  lockhart   197
##  6 chamber_of_secrets professor   190
##  7 chamber_of_secrets   weasley   157
##  8 chamber_of_secrets    looked   155
##  9 chamber_of_secrets      time   148
## 10 chamber_of_secrets      eyes   145
## # ... with 60 more rows
# Plot the bar chart of top words
graph_top

# From the bar charts, we find that main characters are Harry, Ron and Hermione.
# And most common words are usually related to characters.

** 1.2 What are common words in the series after removing characters’ names? **

# After removing the characters's names, plot the top 10 words in each novel

no_char_graph

# The bar charts show that "looked", "eyes", "time"... are in the top words

2. Character analysis: How does the proportion of the three main characters change along with the series / chapters? How does the proportion of other characters change along with the series?

# Calculate the proportion of word in each novel

words_prop
## # A tibble: 63,651 x 5
##                   title series     word     n proportion
##                   <chr>  <int>    <chr> <int>      <dbl>
##  1 order_of_the_phoenix      5    harry  3730 0.03854222
##  2       goblet_of_fire      4    harry  2936 0.04040571
##  3      deathly_hallows      7    harry  2770 0.03773533
##  4    half_blood_prince      6    harry  2581 0.04090462
##  5  prisoner_of_azkaban      3    harry  1824 0.04428474
##  6   chamber_of_secrets      2    harry  1503 0.04470420
##  7 order_of_the_phoenix      5 hermione  1220 0.01260630
##  8   philosophers_stone      1    harry  1213 0.04243484
##  9 order_of_the_phoenix      5      ron  1189 0.01228598
## 10      deathly_hallows      7 hermione  1077 0.01467183
## # ... with 63,641 more rows
# Calculate the words' proportion by chapters in each novel
words_prop_chapter
## # A tibble: 215,433 x 6
##                   title chapter series  word     n proportion
##                   <chr>   <int>  <int> <chr> <int>      <dbl>
##  1   chamber_of_secrets      19      2 harry   173 0.05158020
##  2       goblet_of_fire      31      4 harry   161 0.05439189
##  3       goblet_of_fire      26      4 harry   159 0.05033238
##  4  prisoner_of_azkaban      21      3 harry   153 0.05694083
##  5       goblet_of_fire      28      4 harry   152 0.05175349
##  6 order_of_the_phoenix      35      5 harry   149 0.04623022
##  7 order_of_the_phoenix      24      5 harry   147 0.04766537
##  8    half_blood_prince      18      6 harry   145 0.05228994
##  9       goblet_of_fire      20      4 harry   144 0.05801773
## 10       goblet_of_fire      23      4 harry   144 0.04551201
## # ... with 215,423 more rows

** 2.1 How does the proportion of the three main characters change along with the series? **

# Plot the proportion of the three main characters in each book. 

prop_book_graph

## The propotion of harry and ron slightly decreases with the series, while the proportion of hermione slightly increases. In the second book (chamber of secrets) there is a relatively big gap between the proportion of Ron and Hermione.

** 2.2 How does the proportion of three main characters change along with the chapters in each book? **

# Draw line plots of each novel to compare the proportion change 

prop_chapter_graph

## For the fans of ron or hermione, they can find in which chapter the character has a relatively high proportion. For example, in the first book (philosophers stone), Ron and Hermione appear from the 6th chapter.

** 2.3 How does the proportion of other characters change along with the series? **

other_prop

# The line plot shows that the proportion of Hagrid goes down with the series. Overall, the proportion of Dumbledore goes up from 1 to 6 and it drops in series 7. 

3. Sentiment analysis

** 3.1 What are common joy words and sad words in the seven novels? **

# Extract joy words from sentiment dataset NRC.

nrcjoy
## # A tibble: 689 x 2
##             word sentiment
##            <chr>     <chr>
##  1    absolution       joy
##  2     abundance       joy
##  3      abundant       joy
##  4      accolade       joy
##  5 accompaniment       joy
##  6    accomplish       joy
##  7  accomplished       joy
##  8       achieve       joy
##  9   achievement       joy
## 10       acrobat       joy
## # ... with 679 more rows
# Use inner_join to perform the sentiment analysis.

joy
## # A tibble: 1,713 x 3
##                   title  joyword     n
##                   <chr>    <chr> <int>
##  1 order_of_the_phoenix ministry   191
##  2 order_of_the_phoenix    found   164
##  3 order_of_the_phoenix  feeling   145
##  4       goblet_of_fire  magical   129
##  5       goblet_of_fire ministry   115
##  6    half_blood_prince ministry   113
##  7       goblet_of_fire    found   108
##  8      deathly_hallows ministry    96
##  9    half_blood_prince    found    91
## 10      deathly_hallows    found    87
## # ... with 1,703 more rows
# We can see in each novel, the common joy words is "found". Also, "ministry", "magical", "hope", "smile"... are frequently used joy words in seven books.

joy_graph

# Extract sad words from sentiment dataset NRC.

nrcsad
## # A tibble: 1,191 x 2
##           word sentiment
##          <chr>     <chr>
##  1     abandon   sadness
##  2   abandoned   sadness
##  3 abandonment   sadness
##  4   abduction   sadness
##  5    abortion   sadness
##  6    abortive   sadness
##  7     abscess   sadness
##  8     absence   sadness
##  9      absent   sadness
## 10    absentee   sadness
## # ... with 1,181 more rows
# Use inner_join to perform the sentiment analysis.
sad
## # A tibble: 2,559 x 3
##                   title sadword     n
##                   <chr>   <chr> <int>
##  1 order_of_the_phoenix   harry  3730
##  2       goblet_of_fire   harry  2936
##  3      deathly_hallows   harry  2770
##  4    half_blood_prince   harry  2581
##  5  prisoner_of_azkaban   harry  1824
##  6   chamber_of_secrets   harry  1503
##  7   philosophers_stone   harry  1213
##  8  prisoner_of_azkaban   black   332
##  9       goblet_of_fire   moody   309
## 10      deathly_hallows   death   305
## # ... with 2,549 more rows
# We can see in each novel, the common sad words is "black", "dark". Also, "kill", "bad", "leave", "death"... are frequently used sad words in seven books. If we use NRC to do the sentiment analysis, we will find something wierd since "mother" is in both joy and sad words list.

sad_graph

# Check the word "mother" in NRC lexicon. We can see that "mother" can be different sentiment. So when analysis sentiment here, we should not take "mother" into account.

get_sentiments("nrc")%>%filter(word=="mother")
## # A tibble: 6 x 2
##     word    sentiment
##    <chr>        <chr>
## 1 mother anticipation
## 2 mother          joy
## 3 mother     negative
## 4 mother     positive
## 5 mother      sadness
## 6 mother        trust

** 3.2 How does the sentiment change along with the series / chapters? Does it become more positive or negative? **

# 3.2.1 Compare the ratio of negative and positive words used in the seven books. Bigger ratio indicate more negative sentiment.

ratio_np

# The line graph shows that the ratio of negative and positive words fluctuates, a high ratio usually followed by a relatively low ratio in the next book, except that the ratio of prisoner_of_azkaban is higher than chamber of secrets.

# 3.2.2 How does the ratio change through chapters in each book?

ratio_chapter_np

# The line graphs of each book show that at the end of the story, the ratio of negative and postive words declines to a lower level, which means the story has a relatively "happy ending". Also according to the fluctuation of each book, we know the ups and downs of the sentiment. For example, in the half blood prince, there is a peak of negative sentiment in chapter 29.

4. Examine how sentiment changes throughout each novel/chapter using section

# Create a tidy text format that record the line number of each word.

series
## # A tibble: 409,485 x 5
##    chapter linenumber      word              title series
##      <int>      <int>     <chr>              <chr>  <int>
##  1       1          1       boy philosophers_stone      1
##  2       1          1     lived philosophers_stone      1
##  3       1          2   dursley philosophers_stone      1
##  4       1          2    privet philosophers_stone      1
##  5       1          2     drive philosophers_stone      1
##  6       1          2     proud philosophers_stone      1
##  7       1          2 perfectly philosophers_stone      1
##  8       1          2    normal philosophers_stone      1
##  9       1          3    people philosophers_stone      1
## 10       1          3    expect philosophers_stone      1
## # ... with 409,475 more rows
# Use Bing lexicon to analyze

series_bing

## Usually, there are more negative words in each section.


# Use AFINN lexicon to analyze

series_afinn

## The results seem to be more reasonable by using AFINN lexicon.

# Take philosophers stone as an example to examine how sentiment changes throughout the chapter - bing

sentence_sent
## # A tibble: 141 x 5
##    chapter index negative positive sentiment
##      <int> <dbl>    <dbl>    <dbl>     <dbl>
##  1       1     0       16       13        -3
##  2       1     1       26       14       -12
##  3       1     2       11        9        -2
##  4       1     3       12        4        -8
##  5       1     4       15       13        -2
##  6       1     5       22       17        -5
##  7       1     6       14       20         6
##  8       1     7       14       20         6
##  9       2     0       16       12        -4
## 10       2     1       19       17        -2
## # ... with 131 more rows
stone_graph

5. Using wordcloud to find the most common words in Harry Potter

sevenbook%>%
  count(word)%>%
  with(wordcloud(word,n,max.words=100))

# Throughout the seven books, according to the wordcloud, we also get the main characters are "Harry", "Ron", "Hermione", "Dumbledore" and "Hagrid"...

** Find the most common positive and negative words **

sevenbook%>%
  inner_join(get_sentiments("bing"))%>%
  count(word,sentiment,sort=T)%>%
  acast(word~sentiment,value.var="n",fill=0)%>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"),
                   max.words=50)  
## Joining, by = "word"

# From the word cloud, we find that the most common positive words throughout the series are "magic", "top", "happy", "gold", "love", "nice"... And the most common negative words are "dark", "fell", "hard", "death"...

6. What is the relationship of words in Harry Potter? Create bigram and analyze the relationship between words.

# Examine the most common bigrams

bigram_n
## # A tibble: 523,420 x 3
## # Groups:   title [7]
##                   title     bigram     n
##                   <chr>      <chr> <int>
##  1 order_of_the_phoenix     of the  1192
##  2      deathly_hallows     of the  1002
##  3       goblet_of_fire     of the   901
##  4 order_of_the_phoenix     in the   872
##  5    half_blood_prince     of the   707
##  6 order_of_the_phoenix said harry   689
##  7      deathly_hallows     in the   673
##  8       goblet_of_fire     in the   673
##  9 order_of_the_phoenix     at the   607
## 10 order_of_the_phoenix     on the   603
## # ... with 523,410 more rows
# The most common bigrams are some we are not interested in.
# Remove cases either is a stop-word

# new bigram counts:

bigram_counts
## # A tibble: 89,120 x 3
##           word1      word2     n
##           <chr>      <chr> <int>
##  1    professor mcgonagall   578
##  2        uncle     vernon   386
##  3        harry     potter   349
##  4        death     eaters   346
##  5        harry     looked   316
##  6        harry        ron   302
##  7         aunt    petunia   206
##  8 invisibility      cloak   192
##  9    professor  trelawney   177
## 10         dark       arts   176
## # ... with 89,110 more rows
# We can see that names are the most common pairs in Harrypotter series. 
# Harry and ron usually appear together. 
# Also, Ron and Hermione usually appear together.
# Unite and analyze

bigrams_united
## # A tibble: 107,016 x 4
##                 title series               bigram     n
##                 <chr>  <int>                <chr> <int>
##  1 philosophers_stone      1         uncle vernon    97
##  2 philosophers_stone      1 professor mcgonagall    90
##  3 philosophers_stone      1         aunt petunia    52
##  4 philosophers_stone      1         harry potter    26
##  5 philosophers_stone      1         harry looked    22
##  6 philosophers_stone      1 professor dumbledore    20
##  7 philosophers_stone      1   professor quirrell    18
##  8 philosophers_stone      1     hermione granger    16
##  9 philosophers_stone      1         privet drive    16
## 10 philosophers_stone      1   professor flitwick    15
## # ... with 107,006 more rows
# And Professor Mcgonagall is a common character in Harry Potter. From the plot, we find that in the book order of the phoenix, the frequency goes up.

united_graph

# We find that in goblet_of_fire, Harry and Ron usually appear together. 

bigram_harry
## # A tibble: 8,566 x 4
##                   title series       bigram     n
##                   <chr>  <int>        <chr> <int>
##  1       goblet_of_fire      4    harry ron    86
##  2 order_of_the_phoenix      5 harry looked    76
##  3      deathly_hallows      7 harry looked    60
##  4       goblet_of_fire      4 harry looked    58
##  5 order_of_the_phoenix      5    harry ron    54
##  6  prisoner_of_azkaban      3    harry ron    49
##  7      deathly_hallows      7    harry ron    40
##  8    half_blood_prince      6 harry looked    36
##  9    half_blood_prince      6    harry ron    34
## 10   chamber_of_secrets      2 harry looked    33
## # ... with 8,556 more rows
# Analyze sentiment associated with Harry

harry_sentiment
## # A tibble: 447 x 3
##         word score     n
##        <chr> <int> <int>
##  1      yeah     1    47
##  2   reached     1    29
##  3      dear     2    26
##  4      lied    -2    24
##  5   laughed     1    22
##  6   feeling     1    21
##  7  bitterly    -2    19
##  8      fire    -2    19
##  9   stopped    -1    18
## 10 nervously    -2    17
## # ... with 437 more rows
harry_graph

** network of bigrams **

bigram_graph
## IGRAPH f18185c DN-- 85 60 -- 
## + attr: name (v/c), n (e/n)
## + edges from f18185c (vertex names):
##  [1] professor   ->mcgonagall uncle       ->vernon    
##  [3] harry       ->potter     death       ->eaters    
##  [5] harry       ->looked     harry       ->ron       
##  [7] aunt        ->petunia    invisibility->cloak     
##  [9] professor   ->trelawney  dark        ->arts      
## [11] professor   ->umbridge   death       ->eater     
## [13] entrance    ->hall       madam       ->pomfrey   
## [15] dark        ->lord       professor   ->dumbledore
## + ... omitted several edges
network